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Abstract 

We consider the problem of distributed dictionary 
learning, where a set of nodes is required to collec- 
tively learn a common dictionary from noisy measure- 
ments. This approach may be useful in several con- 
texts including sensor networks. Diffusion cooperation 
schemes have been proposed to solve the distributed 
linear regression problem. In this work we focus on a 
diffusion-based adaptive dictionary learning strategy: 
each node records observations and cooperates with its 
neighbors by sharing its local dictionary. The resulting 
algorithm corresponds to a distributed block coordi- 
nate descent (alternate optimization). Beyond dictio- 
nary learning, this strategy could be adapted to many 
matrix factorization problems and generalized to var- 
ious settings. This article presents our approach and 
illustrates its efficiency on some numerical examples. 

Keywords: dictionary learning, sparse coding, dis- 
tributed estimation, diffusion, matrix factorization, 
adaptive networks, block coordinate descent. 



1 Introduction 

In a variety of contexts, huge amounts of high dimen- 
sional data are recorded from multiple sensors. When 
sensor networks are considered, it is desirable that com- 
putations be distributed over the network rather than 
centralized in some fusion unit. Indeed, centralizing all 
measurements lacks robustness - a failure of the cen- 
tral node is fatal - and scalability due to the needed 
energy and communication resources. In distributed 
computing, every node communicates with its neigh- 
bors only and processing is carried out by every node 
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in the network. Another important remark is that rele- 
vant information from the data usually lives in a space 
of much reduced dimension compared to the physical 
space. The extraction of this relevant information calls 
for the identification of some adapted sparse represen- 
tation of the data. Sparsity is an important property 
which favors the identification of the main components 
that are characteristic of the data. Each observation is 
then described by a sparse subset of atoms taken from 
a redundant dictionary. We study the problem of dic- 
tionary learning distributed over a sensor network in a 
setting where a set of nodes is required to collectively 
learn an adaptive sparse representation of independent 
observations. 

Learning an adaptive representation of the data is 
useful for many tasks such as storing, transmitting or 
analyzing the data to understand its content. A ba- 
sic dictionary can be obtained by using a Principal 
Component Analysis (PCA) also known as Karhunen- 
Loeve decomposition in signal processing. However the 
number of atoms of such a decomposition is limited 
to the dimension of the data space. A more adapted 
representation is obtained by using a redundant dic- 
tionary where for instance no orthogonality property 
is imposed. While the number of potential character- 
istic sources is large, the number of effective sources 
which contribute to the signal observed by a sensor at 
a single moment is much smaller. Many recent works 
have shown the interest of learning a redundant dictio- 
nary allowing for a sparse representation of the data, 
see [TFllj for an up-to-date review. Furthermore, the 
problem of dictionary learning belongs to the more gen- 
eral family of matrix factorization problems that ap- 
pear in a host of applications. 

In this paper, we consider the situation where a set 
of connected nodes independently record data from ob- 
servations of the same kind of physical system: each 
observation is assumed to be described by a sparse rep- 



reservation using a common dictionary for all sensors. 
For instance, a set of cameras observe the same kind of 
scenes or a set of microphones records the same kind 
of sound environment. 

The dictionary learning and the matrix factorization 
problems are connected to the linear regression prob- 
lem. Let us consider that a set of observations of the 
system is described by a data matrix S where each 
column corresponds to one observation. Assume that 
S = DX. If either the coefficients X (resp. the dic- 
tionary D) are known, the estimation of the dictio- 
nary (resp. the coefficients) knowing S is a linear re- 
gression problem. It appears that several recent works 
have proposed efficient solutions to the problem of least 
mean square (LMS) distributed linear regression, see 
CS1Q] and references therein. The main idea is to 
use a so-called diffusion strategy: each node n carries 
out its own estimation D n of the same underlying lin- 
ear regression vector D but can communicate with its 
neighbors as well. The information provided to some 
node by its neighbors is taken into account according 
to some weights interpreted as diffusion coefficients. 
Under some mild conditions, the performance of such 
an approach in terms of mean squared error is simi- 
lar to that of a centralized approach [ZS12 . Denot- 
ing by D c the centralized estimate which uses all the 
observations at once, it can be shown that the error 
E||D„ — D|j 2 of the distributed estimate is of the same 
order as E||D C — D|| 2 : diffusion networks match the 
performance of the centralized solution. 

Our work gives strong indication that the classi- 
cal dictionary learning technique based on block co- 
ordinate descent on the dictionary D and the coeffi- 
cients X can be adapted to the distributed framework 
by adapting the diffusion strategy mentionned above. 
Our numerical experiments also strongly support this 
idea. The theoretical analysis is the subject of ongoing 
work. Note that solving this type of matrix factor- 
ization problems is really at stake since it corresponds 
to many inverse problems: denoising, adaptive com- 
pression, recommendation systems... A distributed ap- 
proach is highly desirable both for use in sensor net- 
work and for parallelization of numerically expensive 
learning algorithms. 

The paper is organized as follows. Section [2] formu- 
lates the problem we are considering. Section [3] recalls 
about dictionary learning techniques based on block 
coordinate descent approaches. Section [4] presents the 
diffusion strategy for distributed dictionary learning. 
Section [5] shows some numerical experiments and re- 
sults. Section [6] points to main claims and prospects. 



2 Problem formulation 

Consider N nodes over some region. In the following, 
boldfaced letters denote column vectors, and capital 
letters denote matrices. The node n takes q n mea- 
surements y n (i), 1 < i < q n from some physical sys- 
tem. All the observations are assumed to originate 
from independent realizations s n (i) of the same under- 
lying stochastic source process s. Each measurement 
is a noisy measurement 



Yn(i) =s n («)+z„(i) 



(1) 



where z denotes the usual i.i.d. Gaussian noise with 
covariance matrix E n = <r„I. Our purpose is to learn 
a redundant dictionary D which carries the character- 
istic properties of the data. This dictionary must yield 
a sparse representation of s so that: 



Vn, y„ (i) = Dx„ (i) +z„ (i) 

s„(i) 



(2) 



where x„(«) features the coefficients X n k(i) associated 
to the contribution of atom d^, the k-th column in the 
dictionary matrix D, to s n (i). The sparsity of x„(i) 
means that only few components of x„(i) are non zero. 
We are considering the situation where a unique dic- 
tionary D generates the observations at all nodes. On 
the contrary, observations will not be shared between 
nodes (this would be one potential generalization). Our 
purpose is to learn (estimate) this dictionary in a dis- 
tributed manner thanks to in-network computing only, 
see section [4j As a consequence, each node will locally 
estimate a local dictionary D„ thanks to i) its obser- 
vations y„ and ii) communication with its neighbors. 
The neighborhood of node n will be denoted by Af n , in- 
cluding node n itself. The number of nodes connected 
to node n is the degree v n . 



3 Dictionary learning strategies 

3.1 Problem formulation 

Various approaches to dictionary learning have been 
proposed [TF11] , Usually, in the centralized setting, 
the q observations are denoted by y(i) G M. p and 
grouped in a matrix Y = [y(l), ...,y((?)]. As a con- 
sequence, Y £ W xq . The dictionary (associated to 
some linear transform) is denoted by D £ W xK : each 
column is one atom d^ of the dictionary. We gather 
the coefficients associated to observations in a single 
matrix X = [x(l), ...,x(q)]. We will consider learning 
methods based on block coordinate descent or alternate 



optimization on D and X with a sparsity constraint on 
X |Tse01llTFn) . 

The data is represented as the sum of a linear com- 
bination of atoms and a noise term Z € R px? : 



Y = DX + Z 



(3) 



Since dictionary learning is a matrix factorization prob- 
lem, it is an ill-posed problem. The dictionary is po- 
tentially redundant and not necessarily orthogonal so 
that K 3> p. Some modeling is necessary to constrain 
the set of possible solutions. Various conditions can 
be considered (non-negative matrix factorization, or- 
thogonal decomposition, ...). In general, a dictionary 
is considered as adapted to the data if each observation 
y(i) can be described by a little number of coefficients 
x(z). One usually searches for a sparse representation 
and imposes the sparsity of X [TF11 . 

3.2 Learning a redundant dictionary 
for sparse representation 

The properties of redundancy of the dictionary and 
sparsity of the coefficients are complementary. The 
extreme case would be the one where the dictionary 
contains each one of the true data underlying the noisy 
observation so that only one non zero coefficient in x(i) 
would be sufficient to describe the observation y(i). 
Then we would have K — q and X would be maximally 
sparse (only 1 non zero coefficient per observation) . Of 
course this dictionary would not be very interesting 
since its generalization power would be very limited. 
A good dictionary must offer a compromise between 
its fidelity to the learning data set and its ability to 
generalize. The choice of the size of the dictionary is 
often made a priori so that K > p to ensure some 
redundancy and K < q to ensure it can capture some 
general information shared by the data. For instance, 
when working on image patches of size 8x8 (the data 
lives in dimension p = 64), it is typically proposed to 
learn dictionaries of size 256 or 512 [AEB061 ITFTT] . 

In the classical setting, the noise is usually assumed 
to be i.i.d. Gaussian noise so that the reconstruction 
error is measured by the L2-norm. Sparsity of the co- 
efficient matrix is imposed through a LO relaxed to Ll- 
penalization in the mixed optimization problem: 

(D,X)=argmin (DjX) 1||Y - DX||| + A||X||i (4) 

Under some mild conditions, this problem is known to 
provide a solution to LO-penalized problem (ideally we 
would prefer to directly solve the LO-penalized prob- 
lem) [SMFlOj . 



3.3 Block coordinate descent 

One way to solve problem Q is to use block coordi- 
nate descent |Tse01j . that is alternate optimization on 
X and D. There are several possibilities to do this, see 
e.g. AEB06 . For instance, after some initialization, 
one may use gradient descents on X and D [OF96 . 
Such approaches are attractive since we know that lin- 
ear regression by gradient descent can be translated in 
the distributed framework |CS10| . 

One possible choice is the Basis Pursuit algorithm. 

At each step, the forward-backward splitting (Basis 

Pursuit Denoising with Iterated Soft Thresholding, see 

SMFlOj. p. 161) iteratively estimates X by iterating 

the following steps over s and t: 

1 x (m+i/2) = x(^)+A/iD( s >*) T [Y - D^X^')] 
(gradient descent step, Vn) 

2. X( s -* +1 ) SoftThreshold AM (X(^ t+1 / 2 )) 

(soft thresholding step) 

Then we update X< s+1 ) = X^ T > after T (typically 30 
or 40) iterated soft thresholding. Note that one must 

have fi £ (0, rrjfnir) where || • \\f denotes the Frobe- 

nius norm. Then the dictionary D^ s+1 ^ can be updated 
knowing X^ s+1 \ using a simple gradient descent: 

f)(s+i) = D W +v \y- dMx<* +1 >1 X< s+1 ) t (5) 

which tends to minimize ||Y — DX|||, with respect to 
D for < r\ < 2/A m ax(X T X) (A ma x stands for the 
largest eigen value. The dictionary is then normalized: 



V < k < K, d k 
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d fc . 



(6) 



One may also use Moore-Penrose pseudo-inverse fol- 
lowing the MOD |EAHH99j : 



f)( s + 1 ) 



1 



argmm D 



Y-DX (s+1) ||? 



YX (s+1)T • (x (s+1) X (s+1)T ) l (7) 



again followed by a normalization step. Other more so- 
phisticated methods have also been proposed like FO- 
CUSS |MKD01j, K-SVD JAEB06J or the majorization 
method YBD09J . We do not discuss all these methods 
here for sake of briefness. In the following, it appears 
that methods rooted in the simple gradient descent up- 
date is the easiest to adapt to the distributed diffusion 
strategy. The comparison of performances of various 
methods is under study. 



4 Distributed dictionary learn- 
ing 

4.1 Diffusion strategies for distributed 
estimation 

This section presents one particular effective diffusion 
strategy to solve LMS distributed estimation problems, 
see |CS1Q[ IZS12J for a detailed presentation. Here we 
focus on the Adapt-Then-Combine (ATC) strategy. 

The Adapt-Then-Combine (ATC) strategy aims at 
solving the problem of a scalar least mean squares lin- 
ear regression over a sensor network. Observations 
y n (i) are assumed to arrive sequentially at consecu- 
tive instants i. In the usual setting [CSlOJ, each sensor 
records both a noisy scalar measurement y n (i) S K and 
a set of coefficients x„(i) under the assumption 



Vn(i) 



wjx„i 



+ z„(i). 



(8) 



The objective is to collectively estimate w . The pur- 
pose of ATC is that the sensors {n : 1...N} yield esti- 
mates w„ of the common underlying regression vector 
w from observations {y„(i); x„(i)} at time i. The cost 
function under the assumption of Gaussian noise is: 



iV 



J(x, w) = Y] E\y n (i) - w T x„,i| 2 
- v v ' 

Jioc(x n ,j,w) 



(9) 



n=l 



Let A, C e (R+) NxN two matrices such that: 

j ct >n = a^„ = if i $ Af n , 

I 1 T C = 1 T ,C1 = 1,1 T A = 1 T 



(10) 



where 1 is column vector of ones. The ATC algorithm 
consists of 2 steps: 

V'n,. = w ",i-l + (Adapt) (11) 

eeAf„ 



V TO J( oc (xf 1 j_i,W„,i_l) 



XI a ^«^,i (Combine) 



(12) 



^eJV„ 



The ATC algorithm can be seen as a distributed gradi- 
ent descent where each sensor tries to estimate w° as 
w «,i by exploiting its own measurement y n (i) as well as 
information shared with its neighbors. Eq. |TT|) is the 



Adapt or incremental step, eq. ( 12 1 is the Combine or 



diffusion step which averages estimates from neighbors 
of node n. As a consequence, a local (possibly aver- 
aged if C 7^ I) gradient with respect to w is computed 



at each node. An intermediate updated version of the 
local estimate of w D denoted by -0n.i is then obtained. 
The final estimate at each node is a local average of 
neighboring intermediate estimates. 

In the sequel, we will focus on the case where obser- 
vations are not shared between nodes so that matrix 
C = (ci : n) is simply identity C = I. Various choices 
can be considered for A. In the numerical experiments 
below we typically work with either some a priori fixed 
matrix A or with the relative degree variance: 



VfO% 



£ 



mejV n Vm °m 



(13) 



The performance analysis of this ATC diffusion 
strategies and some other variants can be found in 
|ZS12| . The mean-square error of the ATC estimate of 
w is similar to that of the centralized version (which 
would see all the observations at once). As a conclu- 
sion, this diffusion strategy is very powerful to deal 
with a distributed solution to a linear regression prob- 
lem. Let us emphasize that in this setting each obser- 
vation is made of a couple (y n , x„) where y n is a scalar. 
In the dictionary learning problem, only the vector y n 
will be observed and both the dictionary (therefore D 
in place of w D ) and the coefficient x„ are to be jointly 
estimated: this is a factorization problem. 

4.2 Distributed alternate optimization 
for dictionary learning 

The ATC diffusion strategy for distributed estimation 
described above originates the following approach to 
distributed block-coordinate descent (alternate opti- 
mization) for dictionary learning. We will mainly keep 
the concept of diffusion to ensure communication be- 
tween nodes: every node will share its dictionary es- 
timate with its neighbors in Af n . Let us remark some 
differences in our setting compared to setting of sec- 
tion 4.1 Observations will be the vectors (not only 
scalar) y«(i), i = l-..q n at node n. Observations are 
taken simultaneously at each node, not sequentially, so 
that a whole data matrix Y n is assumed to be avail- 
able at node n. Here index i stands for iterations. The 
case where data arrive sequentially at each node can 
also be dealt with at the price of a natural adapta- 
tion of the present approach. Note that the x„ ( j are 
not known anymore: each node must estimate both 
its local dictionary D n and the coefficients X„ which 
describe observations Y„ — D„X„ + Z„. At each it- 
eration i, only the local dictionary estimates D n ,i are 
assumed to be shared between neighbors, not observa- 
tions, so that C = I in eq. (10 1. The algorithm goes 



Algorithm 1: ATC for sparse dictionary learning 
Initialize D n] o, Vn (see in the tex t for various options). 

I,* 

(D T , 



-™-n,i)r 



=1:N 



Given a matrix A satisfying (10), i = 0, 

Repeat until convergence oT 

For each node n repeat: 

1) Optimization w.r.t. X n ^ (sparse rep.): 
Given the dictionary, the coefficients are 
iteratively updated through 
For t = 1 : M (typically M = 30) 
n v (t+i/2) _ x (t) , jf n T ( V 

(gradient descent step) 
ii) X^ 1 ' = SoftThreshold A „ M x (X^ 1/2) ) 
EndFor (t) 



2) Optimization w.r.t. T) n ,i (dictionary) 

Mn ( * n ~ D„.iX, 

,: N k a e,n^i,i (diffusion) 



^n,i+l — ^n,i + Mn C^« — D nii X„ ii )X ni 



EndFor (n) 

i <- i + 1 
EndRepeat 



as follows. First the local dictionaries D nj o are initial- 
ized to a random set of K observations (columns) from 
Y n at node n. Then we iteratively solve the sparse 
representation problem Q at each node, for instance 
using the forward-backward splitting method over a 
large number M of iterations, see section 3.3 The 
penalty parameter A„ may be adjusted for each node 
according to the local noise level a\ . Note that positive 



learning rates /i„ (resp. /i 
Mn < mdAii. ( res P- Vn < 



D ) must obey the condition 

l|D„,,|| F V— V Pn ^ ||X„,i|| F )- 

In summary, each node updates its dictionary as a 
function of its local observations Y„ (Adapt step) and 
its neighbors' dictionaries (Combine step). Sparse rep- 
resentations are computed locally. Based on known 
results for the ATC strategy in its usual setting, we 
expect the present Algorithm 1 above converges to an 
accurate estimate of the common underlying dictionary 
D. Next section supports this intuition thanks to nu- 
merical experiments on images. 



5 Numerical experiments & re- 
sults 

We present some numerical experiments to illustrate 
the relevance and efficiency of our approach. In the 
spirit of the seminal work by Olshausen & Field [OF96 
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Figure 1: (a) Examples of patches, (b) true dictionary 
used for synthesis, (c) & (e) local dictionaries for 2 dif- 
ferent nodes, (d) dictionary learnt from the usual cen- 
tralized learning, (f) dictionary averaged over nodes' 
estimates. Atoms have been reordered to make com- 
parisons easier. 



our algorithm was tested on datasets containing con- 
trolled forms of sparse structure. We consider a set of 
r x r image patches composed of sparse pixels. Each 
pixel was activated independently according to an ex- 
ponential distribution, P(x) ex e~' x K 

We consider the simple situation of a set of 4 nodes 
in a symmetrically connected network. Thus we used 
a symmetric matrix A such that: 



A = 



0.6 0.2 

0.2 0.6 0.2 

0.2 0.6 

0.2 0.2 



0.2 


0.2 
0.6 



(14) 



Note that nodes are not even directly connected one to 
all the others. 

Fig. [I] shows that all the nodes have consistently 
learnt the same dictionary. Let us emphasize that these 
dictionaries are consistent, in the sense that no local 



reordering was necessary at any step. All the nodal 
dictionaries D n are close to the same common dictio- 
nary D. It appears that the mean-square error over all 
estimates is similar to that obtained from the central- 
ized dictionary learning procedure of section [3] There- 
fore even though each nodeslocally solves a matrix fac- 
torization problem from a particular disjoint subset of 
observations, the same common dictionary is (approx- 
imately) identified. This is made possible by the diffu- 
sion principle which relies on a simple communication 
between neighbors only. 



6 Conclusion &; Prospects 

As a conclusion, we have presented an original algo- 
rithm which solves the problem of distributed dictio- 
nary learning over a sensor network. This is made pos- 
sible thanks to a diffusion strategy which permits some 
local communication between neighbors. Connected 
nodes can exchange their local dictionaries which are 
estimated from disjoint subsets of data. This algo- 
rithm is the adaptation of usual dictionary learning 
techniques for sparse representation to the context of 
in-network computing. Some numerical experiments 
illustrate the relevance of our approach. The theoret- 
ical study of the algorithm is the subject of ongoing 
work. Several improvements and generalizations can 
also be considered. Many methods are available for 
sparse coding. This choice is crucial to get better dic- 
tionary estimates. We will study which method is most 
adapted to this distributed setting. The optimization 
of communication coefficients may be of some help as 
well. 

We believe that this approach to the general prob- 
lem of distributed matrix factorization opens the way 
towards many prospects and applications. Moreover, 
as far as computational complexity is concerned, dis- 
tributed parallel implementations are a potentially 
interesting alternative to online learning techniques 
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